policeDistricts <-
st_read("https://data.cityofchicago.org/api/geospatial/fthy-xz3r?method=export&format=GeoJSON") %>%
st_transform('ESRI:102271') %>%
dplyr::select(District = dist_num)
## Reading layer `OGRGeoJSON' from data source
## `https://data.cityofchicago.org/api/geospatial/fthy-xz3r?method=export&format=GeoJSON'
## using driver `GeoJSON'
## Simple feature collection with 25 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -87.94011 ymin: 41.64455 xmax: -87.52414 ymax: 42.02303
## Geodetic CRS: WGS 84
policeBeats <-
st_read("https://data.cityofchicago.org/api/geospatial/aerh-rz74?method=export&format=GeoJSON") %>%
st_transform('ESRI:102271') %>%
dplyr::select(District = beat_num)
## Reading layer `OGRGeoJSON' from data source
## `https://data.cityofchicago.org/api/geospatial/aerh-rz74?method=export&format=GeoJSON'
## using driver `GeoJSON'
## Simple feature collection with 277 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -87.94011 ymin: 41.64455 xmax: -87.52414 ymax: 42.02303
## Geodetic CRS: WGS 84
bothPoliceUnits <- rbind(mutate(policeDistricts, Legend = "Police Districts"),
mutate(policeBeats, Legend = "Police Beats"))
drug <-
read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2017/d62x-nvdr") %>%
filter(Primary.Type == "NARCOTICS") %>%
mutate(x = gsub("[()]", "", Location)) %>%
separate(x,into= c("Y","X"), sep=",") %>%
mutate(X = as.numeric(X),Y = as.numeric(Y)) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant")%>%
st_transform('ESRI:102271') %>%
distinct()
chicagoBoundary <-
st_read(file.path(root.dir,"/Chapter5/chicagoBoundary.geojson")) %>%
st_transform('ESRI:102271')
## Reading layer `chicagoBoundary' from data source
## `https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA///Chapter5/chicagoBoundary.geojson'
## using driver `GeoJSON'
## Simple feature collection with 1 feature and 1 field
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: -87.8367 ymin: 41.64454 xmax: -87.52414 ymax: 42.02304
## Geodetic CRS: WGS 84
drug <- st_intersection(drug, chicagoBoundary)
The maps below show the spatial distribution and density of reported narcotics in Chicago. While located almost all over the city except for the south end that is an international golf center, narcotics are more concentrated in the South Side and a few neighborhoods in the west part of the city.
Narcotics is a crime type that is highly exposed to selection bias. To begin with, whether one calls the police for an incident of possessing or selling drugs can be highly related to the type of neighborhood; in a neighborhood that takes a liberal stand to drugs as a common sight, one might not even bother to report narcotics. Moreover, narcotics can receive selective law enforcement as previous research finds that both white and black communities have similar amount of drug usage but neighborhoods of POC experience a disproportionately high amount of law enforcement regarding to narcotics with police policies such as stop-and-frisk. Thus, the relatively low density of narcotics in the North Side that is whiter and richer than the South Side might not necessarily due to less incidents of narcotics but due to the selection bias in reporting and law enforcement.
Yet selection bias is a real challenge that risk prediction cannot ignore for the risk levels predicted by the model can have an influence on policy making. For example, if the selection bias in narcotics is likely to over-document incidents in black/brown neighborhoods while under-document incidents in white neighborhoods, the model trained with biased data is likely to perpetuate the bias against the minority communities and further harm the community with over-policing. Thus, this project attempts to hold bias accountable as much as possible. However, the logic of the modeling for this project is based on Broken Window Theory that has been highly controversial for positing a link between disorder and crime, and the indicators for disorder–such as graffiti and abandoned cars–are generated through 311 open data sets that are also exposed to selection bias of reporting disorder. While the model is unlikely to be useful without bias, this project provides an opportunity to reflect on the bias that are in the data and processed through the modeling process.
grid.arrange(ncol=2,
ggplot() +
geom_sf(data = chicagoBoundary) +
geom_sf(data = drug, colour="red", size=0.1, show.legend = "point") +
labs(title= "Narcotics, Chicago - 2017") +
mapTheme(title_size = 14),
ggplot() +
geom_sf(data = chicagoBoundary, fill = "grey40") +
stat_density2d(data = data.frame(st_coordinates(drug)),
aes(X, Y, fill = ..level.., alpha = ..level..),
size = 0.01, bins = 40, geom = 'polygon') +
scale_fill_viridis() +
scale_alpha(range = c(0.00, 0.55), guide = FALSE) +
labs(title = "Density of Narcotics") +
mapTheme(title_size = 14) + theme(legend.position = "none"))
The map below shows the count of narcotics by the fishnet that divides Chicago into 500ft by 500ft grid cells. The map confirms the clustering of narcotics in the South Side and the extremely high concentration in some north-west neighborhoods.
fishnet <-
st_make_grid(chicagoBoundary,
cellsize = 500,
square = TRUE) %>%
.[chicagoBoundary] %>% # <- MDH Added
st_sf() %>%
mutate(uniqueID = rownames(.))
crime_net <-
dplyr::select(drug) %>%
mutate(countDrug = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(countDrug = replace_na(countDrug, 0),
uniqueID = rownames(.),
cvID = sample(round(nrow(fishnet) / 24),
size=nrow(fishnet), replace = TRUE))
ggplot() +
geom_sf(data = crime_net, aes(fill = countDrug), color = NA) +
scale_fill_viridis() +
labs(title = "Count of Narcotics for the fishnet") +
mapTheme()
The map below incorporates the fishnet map into the base map of Chicago. The map shows that there a a few cells with extremely high count of narcotics that exceed other cells by a far. This clustering is likely to skew the modeling.
mapview(crime_net, zcol = "countDrug")
abandonCars <-
read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Abandoned-Vehicles/3c9v-pnva") %>%
mutate(year = substr(creation_date,1,4)) %>% filter(year == "2017") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Abandoned_Cars")
abandonBuildings <-
read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Vacant-and-Abandoned-Building/7nii-7srd") %>%
mutate(year = substr(date_service_request_was_received,1,4)) %>% filter(year == "2017") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Abandoned_Buildings")
graffiti <-
read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Graffiti-Removal-Historical/hec5-y4x5") %>%
mutate(year = substr(creation_date,1,4)) %>% filter(year == "2017") %>%
filter(where_is_the_graffiti_located_ %in% c("Front", "Rear", "Side")) %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Graffiti")
streetLightsOut <-
read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Street-Lights-All-Out/zuxi-7xem") %>%
mutate(year = substr(creation_date,1,4)) %>% filter(year == "2017") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Street_Lights_Out")
sanitation <-
read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Sanitation-Code-Complaints-Hi/me59-5fac") %>%
mutate(year = substr(creation_date,1,4)) %>% filter(year == "2017") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Sanitation")
liquorRetail <-
read.socrata("https://data.cityofchicago.org/resource/nrmj-3kcf.json") %>%
filter(business_activity == "Retail Sales of Packaged Liquor") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Liquor_Retail")
garbageCarts <-
read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Garbage-Carts-Historical/9ksk-na4q") %>%
mutate(year = substr(creation_date,1,4)) %>% filter(year == "2017") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Garbage_Carts")
neighborhoods <-
st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson") %>%
st_transform(st_crs(fishnet))
## Reading layer `chicago' from data source
## `https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson'
## using driver `GeoJSON'
## Simple feature collection with 98 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -87.94011 ymin: 41.64454 xmax: -87.52414 ymax: 42.02304
## Geodetic CRS: WGS 84
The map below shows the count of risk factors by the fishnet. Abandoned buildings are more concentrated in the South Side while abandoned cars are more concentrated in the North Side. Graffiti clusters along the transportation corridors of the 90, 290, and 55 highways. Liquor retails clusters in the Loop and the North Side. Open garbage Carts are distributed across the city but with a lower density in the east and along highway corridors. Sanitation is more concentrated in the North Side. Street Lights that are out are almost evenly distributed in the city.
vars_net <-
rbind(abandonCars,streetLightsOut,abandonBuildings,
liquorRetail, graffiti, sanitation, garbageCarts)%>%
st_join(., fishnet, join=st_within) %>%
st_drop_geometry() %>%
group_by(uniqueID, Legend) %>%
summarize(count = n()) %>%
full_join(fishnet, by = "uniqueID") %>%
spread(Legend, count, fill=0) %>%
st_sf() %>%
dplyr::select(-`<NA>`) %>%
na.omit() %>%
ungroup()
vars_net.long <-
gather(vars_net, Variable, value, -geometry, -uniqueID)
vars <- unique(vars_net.long$Variable)
mapList <- list()
for(i in vars){
mapList[[i]] <-
ggplot() +
geom_sf(data = filter(vars_net.long, Variable == i), aes(fill=value), colour=NA) +
scale_fill_viridis(name="") +
labs(title=i) +
mapTheme()}
do.call(grid.arrange,c(mapList, ncol=3, top="Risk Factors by Fishnet"))
The map below shows the nearest neighbor distances of the risk factors by the fishnet.
vars_net <-
vars_net %>%
mutate(
Abandoned_Buildings.nn =
nn_function(st_c(st_coid(vars_net)), st_c(abandonBuildings),3),
Abandoned_Cars.nn =
nn_function(st_c(st_coid(vars_net)), st_c(abandonCars),3),
Graffiti.nn =
nn_function(st_c(st_coid(vars_net)), st_c(graffiti),3),
Liquor_Retail.nn =
nn_function(st_c(st_coid(vars_net)), st_c(liquorRetail),3),
Street_Lights_Out.nn =
nn_function(st_c(st_coid(vars_net)), st_c(streetLightsOut),3),
Sanitation.nn =
nn_function(st_c(st_coid(vars_net)), st_c(sanitation),3),
Garbage_Carts.nn =
nn_function(st_c(st_coid(vars_net)), st_c(garbageCarts),3))
vars_net.long.nn <-
dplyr::select(vars_net, ends_with(".nn")) %>%
gather(Variable, value, -geometry)
vars <- unique(vars_net.long.nn$Variable)
mapList <- list()
for(i in vars){
mapList[[i]] <-
ggplot() +
geom_sf(data = filter(vars_net.long.nn, Variable == i), aes(fill=value), colour=NA) +
scale_fill_viridis(name="") +
labs(title=i) +
mapTheme()}
do.call(grid.arrange,c(mapList, ncol = 3, top = "Nearest Neighbor Risk Factors by Fishnet"))
The map below shows the distance to key neighborhoods: the Loop and Garfield Park. This project calculates the distance to Garfield Park because it is known for narcotics activities.
loopPoint <-
filter(neighborhoods, name == "Loop") %>%
st_centroid()
vars_net$loopDistance =
st_distance(st_centroid(vars_net),loopPoint) %>%
as.numeric()
Garfield_Park_Point <-
filter(neighborhoods, name == "Garfield Park") %>%
st_centroid()
vars_net$Garfield_Park_Distance =
st_distance(st_centroid(vars_net),Garfield_Park_Point) %>%
as.numeric()
final_net <-
left_join(crime_net, st_drop_geometry(vars_net), by="uniqueID")
final_net <-
st_centroid(final_net) %>%
st_join(dplyr::select(neighborhoods, name), by = "uniqueID") %>%
st_join(dplyr::select(policeDistricts, District), by = "uniqueID") %>%
st_drop_geometry() %>%
left_join(dplyr::select(final_net, geometry, uniqueID)) %>%
st_sf() %>%
na.omit()
vars_net.long.dis <-
dplyr::select(vars_net, ends_with("Distance")) %>%
gather(Variable, value, -geometry)
vars <- unique(vars_net.long.dis$Variable)
mapList <- list()
for(i in vars){
mapList[[i]] <-
ggplot() +
geom_sf(data = filter(vars_net.long.dis, Variable == i), aes(fill=value), colour=NA) +
scale_fill_viridis(name="") +
labs(title=i) +
mapTheme()}
do.call(grid.arrange,c(mapList, ncol = 3, top = "Distance Risk Factors by Fishnet"))
The maps below show the results from the calculation for the Local Moran’s I. Local Moran’s I is used to test for the spatial autocorrelation to a cell’s immediate eight neighbors. The test generates two significant hot spot for narcotics.
final_net.nb <- poly2nb(as_Spatial(final_net), queen=TRUE)
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE)
local_morans <- localmoran(final_net$countDrug, final_net.weights, zero.policy=TRUE) %>%
as.data.frame()
final_net.localMorans <-
cbind(local_morans, as.data.frame(final_net)) %>%
st_sf() %>%
dplyr::select(Drug_Count = countDrug,
Local_Morans_I = Ii,
P_Value = `Pr(z != E(Ii))`) %>%
mutate(Significant_Hotspots = ifelse(P_Value <= 0.001, 1, 0)) %>%
gather(Variable, Value, -geometry)
vars <- unique(final_net.localMorans$Variable)
varList <- list()
for(i in vars){
varList[[i]] <-
ggplot() +
geom_sf(data = filter(final_net.localMorans, Variable == i),
aes(fill = Value), colour=NA) +
scale_fill_viridis(name="") +
labs(title=i) +
mapTheme(title_size = 14) + theme(legend.position="bottom")}
do.call(grid.arrange,c(varList, ncol = 4, top = "Local Morans I statistics, Narcotics"))
This project then calculates the nearest neighbor feature of cells to the nearest hot spot. The map below shows the distance to the nearest hot spot by the fishnet.
final_net <- final_net %>%
mutate(drug.isSig =
ifelse(local_morans[,5] <= 0.001, 1, 0)) %>%
mutate(drug.isSig.dist =
nn_function(st_c(st_coid(final_net)),
st_c(st_coid(filter(final_net,
drug.isSig == 1))),
k = 1))
ggplot() +
geom_sf(data = final_net, aes(fill=drug.isSig.dist), colour=NA) +
scale_fill_viridis(name="NN Distance") +
labs(title="Narcostics NN Distance") +
mapTheme()
Plots below show the correlation between risk factors and narcotics counts by the fishnet. Importantly, most risk factors show a small correlation with narcotics counts. It is likely that the selection bias for narcotics severely skew the data, and the Broken Window Theory that serves as the foundation of this project might fail to capture the relationship between neighborhood characteristics and crime.
correlation.long <-
st_drop_geometry(final_net) %>%
dplyr::select(-uniqueID, -cvID, -loopDistance, -Garfield_Park_Distance, -name, -District) %>%
gather(Variable, Value, -countDrug)
correlation.cor <-
correlation.long %>%
group_by(Variable) %>%
summarize(correlation = cor(Value, countDrug, use = "complete.obs"))
ggplot(correlation.long, aes(Value, countDrug)) +
geom_point(size = 0.1) +
geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
x=-Inf, y=Inf, vjust = 1.5, hjust = -.1) +
geom_smooth(method = "lm", se = FALSE, colour = "black") +
facet_wrap(~Variable, ncol = 2, scales = "free") +
labs(title = "Narcotics count as a function of risk factors") +
plotTheme()
The plot below shows the count of fishnet cells by the count of narcotics in each cell. The distribution of the dependent variable shows a Poisson distribution of counts. There are a handful cells with extremely high number of narcotics incidents, and they can propose serious challenges for the model to hold the outliers accountable.
ggplot(final_net, aes(countDrug)) +
geom_histogram(binwidth = 1) +
labs(title = "Narcotics Distribution")
The following section creates two regression models that predict the risk of narcotics. The base model includes only risk factors that are selected based on the correlation test; the spatial model includes both risk factors and the spatial process of clustering calculated by the Local Moran’s I.
reg.vars <- c("Abandoned_Buildings", "Abandoned_Cars.nn", "Graffiti.nn",
"Liquor_Retail.nn", "Street_Lights_Out", "Sanitation", "Garbage_Carts",
"loopDistance", "Garfield_Park_Distance")
reg.ss.vars <- c("Abandoned_Buildings", "Abandoned_Cars.nn", "Graffiti.nn",
"Liquor_Retail.nn", "Street_Lights_Out", "Sanitation", "Garbage_Carts",
"Garfield_Park_Distance", "loopDistance", "drug.isSig", "drug.isSig.dist")
reg.cv <- crossValidate(
dataset = final_net,
id = "cvID",
dependentVariable = "countDrug",
indVariables = reg.vars) %>%
dplyr::select(cvID = cvID, countDrug, Prediction, geometry)
reg.ss.cv <- crossValidate(
dataset = final_net,
id = "cvID",
dependentVariable = "countDrug",
indVariables = reg.ss.vars) %>%
dplyr::select(cvID = cvID, countDrug, Prediction, geometry)
reg.spatialCV <- crossValidate(
dataset = final_net,
id = "name",
dependentVariable = "countDrug",
indVariables = reg.vars) %>%
dplyr::select(cvID = name, countDrug, Prediction, geometry)
reg.ss.spatialCV <- crossValidate(
dataset = final_net,
id = "name",
dependentVariable = "countDrug",
indVariables = reg.ss.vars) %>%
dplyr::select(cvID = name, countDrug, Prediction, geometry)
reg.summary <-
rbind(
mutate(reg.cv, Error = Prediction - countDrug,
Regression = "Random k-fold CV: Just Risk Factors"),
mutate(reg.ss.cv, Error = Prediction - countDrug,
Regression = "Random k-fold CV: Spatial Process"),
mutate(reg.spatialCV, Error = Prediction - countDrug,
Regression = "Spatial LOGO-CV: Just Risk Factors"),
mutate(reg.ss.spatialCV, Error = Prediction - countDrug,
Regression = "Spatial LOGO-CV: Spatial Process")) %>%
st_sf()
error_by_reg_and_fold <-
reg.summary %>%
group_by(Regression, cvID) %>%
summarize(Mean_Error = mean(Prediction - countDrug, na.rm = T),
MAE = mean(abs(Mean_Error), na.rm = T),
SD_MAE = mean(abs(Mean_Error), na.rm = T)) %>%
ungroup()
The maps below shows the model errors by random k-fold and spatial cross validation. Both models have relatively high errors for the South Side and north-east neighborhoods along the 90 highway corridor. For both models, the error for North Lawndale is extremely high. North Lawndale is right south to Garfield Park and is specified as a hot spot, but the spatial process of hot spots calculated by the Local Moran’s I still could not explain the high count of narcotics in this neighborhood. The weak prediction for North Lawndale might due to the selection bias that current variables cannot hold the results accountable for.
map_name<-unique(error_by_reg_and_fold$Regression)
map_list<-list()
for(i in map_name){
map_list[[i]] <-
ggplot() +
geom_sf(data = filter(error_by_reg_and_fold, Regression == i),
aes(fill=MAE), colour=NA) +
scale_fill_viridis(name="") +
labs(title=i) +
mapTheme()}
do.call(grid.arrange,c(map_list, ncol=2, top="model errors by random k-fold and spatial cross validation"))
The table below shows the MAE and Standard Deviation MAE by regression. While two models show little difference with random k-fold cross validation, the spatial model significantly reduces MAE with the LOGO cross validation.
st_drop_geometry(error_by_reg_and_fold) %>%
group_by(Regression) %>%
summarize(Mean_MAE = round(mean(MAE), 2),
SD_MAE = round(sd(MAE), 2)) %>%
kable() %>%
kable_styling("striped", full_width = F) %>%
row_spec(2, color = "black", background = "#FDE725FF") %>%
row_spec(4, color = "black", background = "#FDE725FF")
| Regression | Mean_MAE | SD_MAE |
|---|---|---|
| Random k-fold CV: Just Risk Factors | 1.74 | 1.61 |
| Random k-fold CV: Spatial Process | 1.66 | 1.72 |
| Spatial LOGO-CV: Just Risk Factors | 2.93 | 3.77 |
| Spatial LOGO-CV: Spatial Process | 1.89 | 2.50 |
The table below shows the raw errors by race context for random k-fold and spatial cross validation regression. The spatial model shows a lower error across race context. Both models under-predicts majority-non-white neighborhoods and over-predicts majority-white neighborhoods.
tracts17 <-
get_acs(geography = "tract", variables = c("B01001_001E","B01001A_001E"),
year = 2017, state=17, county=031, geometry=T) %>%
st_transform('ESRI:102271') %>%
dplyr::select(variable, estimate, GEOID) %>%
spread(variable, estimate) %>%
rename(TotalPop = B01001_001,
NumberWhites = B01001A_001) %>%
mutate(percentWhite = NumberWhites / TotalPop,
raceContext = ifelse(percentWhite > .5, "Majority_White", "Majority_Non_White")) %>%
.[neighborhoods,]
reg.summary %>%
filter(str_detect(Regression, "LOGO")) %>%
st_centroid() %>%
st_join(tracts17) %>%
na.omit() %>%
st_drop_geometry() %>%
group_by(Regression, raceContext) %>%
summarize(mean.Error = mean(Error, na.rm = T)) %>%
spread(raceContext, mean.Error) %>%
kable(caption = "Mean Error by neighborhood racial context") %>%
kable_styling("striped", full_width = F)
The map below shows the comparison between kernel density to risk prediction for the next year’s crime. Kernel density predicts a larger area of high risk over 90% especially around the hot spot in the north-west of Chicago.
drug18 <-
read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2018/3i3m-jwuy") %>%
filter(Primary.Type == "NARCOTICS") %>%
mutate(x = gsub("[()]", "", Location)) %>%
separate(x,into= c("Y","X"), sep=",") %>%
mutate(X = as.numeric(X),
Y = as.numeric(Y)) %>%
na.omit %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform('ESRI:102271') %>%
distinct() %>%
.[fishnet,]
drug_ppp <- as.ppp(st_coordinates(drug), W = st_bbox(final_net))
drug_KD.1000 <- spatstat.core::density.ppp(drug_ppp, 1000)
drug_KD.1500 <- spatstat.core::density.ppp(drug_ppp, 1500)
drug_KD.2000 <- spatstat.core::density.ppp(drug_ppp, 2000)
drug_KD.df <- rbind(
mutate(data.frame(rasterToPoints(mask(raster(drug_KD.1000), as(neighborhoods, 'Spatial')))), Legend = "1000 Ft."),
mutate(data.frame(rasterToPoints(mask(raster(drug_KD.1500), as(neighborhoods, 'Spatial')))), Legend = "1500 Ft."),
mutate(data.frame(rasterToPoints(mask(raster(drug_KD.2000), as(neighborhoods, 'Spatial')))), Legend = "2000 Ft."))
drug_KD.df$Legend <- factor(drug_KD.df$Legend, levels = c("1000 Ft.", "1500 Ft.", "2000 Ft."))
drug_KDE_sf <- as.data.frame(drug_KD.1000) %>%
st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
aggregate(., final_net, mean) %>%
mutate(label = "Kernel Density",
Risk_Category = ntile(value, 100),
Risk_Category = case_when(
Risk_Category >= 90 ~ "90% to 100%",
Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
cbind(
aggregate(
dplyr::select(drug18) %>% mutate(drugCount = 1), ., sum) %>%
mutate(burgCount = replace_na(drugCount, 0))) %>%
dplyr::select(label, Risk_Category, drugCount)
drug_risk_sf <-
reg.ss.spatialCV %>%
mutate(label = "Risk Predictions",
Risk_Category = ntile(Prediction, 100),
Risk_Category = case_when(
Risk_Category >= 90 ~ "90% to 100%",
Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
cbind(
aggregate(
dplyr::select(drug18) %>% mutate(drugCount = 1), ., sum) %>%
mutate(burgCount = replace_na(drugCount, 0))) %>%
dplyr::select(label,Risk_Category, drugCount)
rbind(drug_KDE_sf, drug_risk_sf) %>%
na.omit() %>%
gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
ggplot() +
geom_sf(aes(fill = Risk_Category), colour = NA) +
geom_sf(data = sample_n(drug18, 3000), size = .5, colour = "black") +
facet_wrap(~label, ) +
scale_fill_viridis(discrete = TRUE) +
labs(title="Comparison of Kernel Density and Risk Predictions",
subtitle="2017 drug risk predictions; 2018 drug") +
mapTheme(title_size = 14)
The bar plot below shows the comparison of accuracy of kernel density to risk prediction for the next year’s crime. Risk prediction model does a better job of prediction in all risk categories except for the 70%-89% one. The result is mixed to interpret as risk prediction model might address more of the bias issue for the most risky neighborhoods but might perform worse then kernel density for the second risky neighborhoods that can still be exposed to great bias.
rbind(drug_KDE_sf, drug_risk_sf) %>%
st_set_geometry(NULL) %>% na.omit() %>%
gather(Variable, Value, -label, -Risk_Category) %>%
group_by(label, Risk_Category) %>%
summarize(countDrug = sum(Value)) %>%
ungroup() %>%
group_by(label) %>%
mutate(Rate_of_test_set_drug = countDrug / sum(countDrug)) %>%
ggplot(aes(Risk_Category,Rate_of_test_set_drug)) +
geom_bar(aes(fill=label), position="dodge", stat="identity") +
scale_fill_viridis(discrete = TRUE) +
labs(title = "Risk prediction vs. Kernel density, 2018 drugs") +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5))
I would not recommend my algorithm be put into production. While the risk prediction model overperforms the traditional kernel density model in all risk categories except the 70%-89% one, there is still too much selection bias for this model to be put into production. While my model attempts to hold bias accountable, the extreme outlier of North Lawndale indicates that there are still important factors that my model does not consider to address the clustering of narcotics. The underprediction of hot spots such as North Lawndale can lead to serious short-handed law enforcement. Thus, to improve this model might require a revision on the theory that the model is based on. Future research can focus on what other links exist between crime and neighborhood types besides “disorder” that the Broken Window Theory highlights.
More importantly, as I’ve discussed in previous paragraphs, narcotics is one of the most biased crime types that is significantly influenced by the selection bias in reporting and law enforcement. With the selection bias in the observations to begin with, any risk prediction model that attempts to predict narcotics is likely to fail due to the heavily skewed training data. While I believe that risk prediction models can be useful for other crime types such as muder or fire that face a lower selection bias, prediction on narcotics might need to take another path instead of modeling based on existing quantitative data.